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Abstract 

The question of aggregating pairwise comparisons to obtain a global ranking over a collection 
of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's 
TrueSkill system) and chess players, aggregating social opinions, or deciding which product to 
sell based on transactions. In most settings, in addition to obtaining ranking, finding 'scores' for 
each object (e.g. player's rating) is of interest for understanding the intensity of the preferences. 

In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores 
for objects (or items) from pairwise comparisons. The algorithm has a natural random walk 
interpretation over the graph of objects with an edge present between a pair of objects if they 
are compared; the scores turn out to be the stationary probability of this random walk. The 
algorithm is model independent. To establish the efficacy of our method, however, we consider 
the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which 
determines the probabilistic outcomes of pairwise comparisons between objects. We bound the 
finite sample error rates between the scores assumed by the BTL model and those estimated 
by our algorithm. In particular, the number of samples required to learn the score well with 
high probability depends on the structure of the comparison graph. When the Laplacian of the 
comparison graph has a strictly positive spectral gap, e.g. each item is compared to a subset 
of randomly chosen items, this leads to order-optimal dependence on the number of samples. 
Experimental evaluations on synthetic datasets generated according to the BTL model show 
that our (model independent) algorithm performs as well as the Maximum Likelihood estimator 
for that model and outperforms a recently proposed algorithm by Ammar and Shah |AS11] , 

1 Introduction 

Rank aggregation is an important task in a wide range of learning and social contexts arising in 
recommendation systems, information retrieval, and sports and competitions. Given n items, we 
wish to infer relevancy scores or an ordering on the items based on partial orderings provided 
through many (possibly contradictory) samples. Frequently, the available data that is presented 
to us is in the form of a comparison: player A defeats player B; book A is purchased when books 
A and B are displayed (a bigger collection of books implies multiple pairwise comparisons); movie 
A is liked more compared to movie B. From such partial preferences in the form of comparisons, 
we frequently wish to deduce not only the order of the underlying objects, but also the scores 
associated with the objects themselves so as to deduce the intensity of the resulting preference 
order. 

For example, the Microsoft TrueSkill engine assigns scores to online gamers based on the out- 
comes of (pairwise) games between players. Indeed, it assumes that each player has inherent "skill" 
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and the outcomes of the games are used to learn these skill parameters which in turn lead to scores 
associated with each player. In most such settings, similar model-based approaches are employed. 

In this paper, we have set out with the following goal: develop an algorithm for the above stated 
problem which (a) is computationally simple, (b) works with available (comparison) data only and 
does not try to fit any model per se, (c) makes sense in general, and (d) if the data indeed obeys 
a reasonable model, then the algorithm should do as well as the best model aware algorithm. The 
main result of this paper is an affirmative answer to all these questions. 

Related work. Most rating based systems rely on users to provide explicit numeric scores for 
their interests. While these assumptions have led to a flurry of theoretical research for item recom- 
mendations based on matrix completion [CR09, KMO10, NW12], it is widely believed that numeric 
scores provided by individual users are generally inconsistent. Furthermore, in a number of learning 
contexts as illustrated above, it is simply impractical to ask a user to provide explicit scores. 

These observations have led to the need to develop methods that can aggregate such forms of 
ordering information into relevance ratings. In general, however, designing consistent aggregation 
methods can be challenging due in part to possible contradictions between individual preferences. 
For example, if we consider items A, B, and C, one user might prefer A to B, while another prefers 
B to C, and a third user prefers C to A. Such problems have been well studied as in the work 
by Condorcet |Con85j . In the celebrated work by Arrow [Arr63j . existence of a rank aggregation 
algorithm with reasonable sets of properties (or axioms) was shown to be impossible. 

In this paper, we are interested in a more restrictive setting: we have outcomes of pairwise 
comparisons between pairs of items, rather than a complete ordering as considered in [A"rr63j . 
Based on those pairwise comparisons, we want to obtain a ranking of items along with a score for 
each item indicating the intensity of the preference. One reasonable way to think about our setting 
is to imagine that there is a distribution over orderings or rankings or permutations of items and 
every time a pair of items is compared, the outcome is generated as per this underlying distribution. 
With this, our question becomes even harder than the setting considered by Arrow |Arr63| as, in 
that work, effectively the entire distribution over permutations was already known! 

Indeed, such hurdles have not stopped the scientific community as well as practical designers 
from designing such systems. Chess rating systems and the more recent MSR TrueSkill Ranking 
system are prime examples. Our work falls precisely into this realm: design algorithms that work 
well in practice, makes sense in general, and perhaps more importantly, have attractive theoretical 
properties under common comparative judgment models. 

With this philosophy in mind, in recent work, Ammar and Shah |ASllj have presented an algo- 
rithm that tries to achieve the goal with which we have set out. However, their algorithm requires 
information about comparisons between all pairs, and for each pair it requires the exact pairwise 
comparison 'marginal' with respect to the underlying distribution over permutations. Indeed, in 
reality, not all pairs of items can typically be compared, and the number of times each pair is 
compared is also very small. Therefore, while an important step is taken in |ASllj . it stops short 
of achieving the desired goal. 

In somewhat related work by Braverman and Mossel [BM08], the authors present an algorithm 
that produces an ordering based on 0(n log n) pair-wise comparisons on adaptively selected pairs. 
They assume that there is an underlying true ranking and one observes noisy comparison results. 
Each time a pair is queried, we are given the true ordering of the pair with probability 1/2 + 7 for 
some 7 > which does not depend on the items being compared. One limitation of this model is 
that it does not capture the fact that in many applications, like chess matches, the outcome of a 
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comparison very much depends on the opponents that are competing. 

Such considerations have naturally led to the study of noise models induced by parametric 
distributions over permutations. An important and landmark model in this class is called the 
Bradley- Terry-Luce (BTL) model [BT55 ; Luc59 , which is also known as the Multinomial Logit 
(MNL) model (cf. McFadden [McF73j ). It has been the backb one of many practical system 
designs including pricing in the airline industry [TR05J. Adler, Gemmell, Baiter, Karp and Kenyon 
|AGHB~ |r 94 used such models to design adaptive algorithms that select the winner from small 



number of rounds. Interestingly enough, the (near-)optimal performance of their adaptive algorithm 
for winner selection is matched by our non-adaptive (model independent) algorithm for assigning 
scores to obtain global rankings of all players. 

Finally, earlier work by Dwork et. al. [DKNSO l] propose a number of Markov chain based 
methods for rank aggregation. The chain named MC3 in that work exactly corresponds to the 
algorithm presented here. However, our derivation of the algorithm draws from the intuition arising 
from the recursive update equation discussed in Section [2 .2| Furthermore, in this work we provide 
precise theoretical guarantees predicting the performance of the algorithm. 

Our contributions. In this paper, we provide an iterative algorithm that takes the noisy compar- 
ison answers between a subset of all possible pairs of items as input and produces scores for each 
item as the output. The proposed algorithm has a nice intuitive explanation. Consider a graph 
with nodes/vertices corresponding to the items of interest (e.g. players). Construct a random walk 
on this graph where at each time, the random walk is likely to go from vertex i to vertex j if items i 
and j were ever compared; and if so, the likelihood of going from i to j depends on how often i lost 
to j. That is, the random walk is more likely to move to a neighbor who has more "wins". How 
frequently this walk visits a particular node in the long run, or equivalently the stationary distri- 
bution, is the score of the corresponding item. Thus, effectively this algorithm captures preference 
of the given item versus all of the others, not just immediate neighbors: the global effect induced 
by transitivity of comparisons is captured through the stationary distribution. 

Such an interpretation of the stationary distribution of a Markov chain or a random walk has 
been an effective measure of relative importance of a node in wide class of graph problems, popularly 
known as the network centrality [NcwlOj. Notable examples of such network centralities include the 
random surfer model on the web graph for the version of the PagcRank |BP98 which computes the 
relative importance of a web page, and a model of a random crawler in a peer-to-peer file-sharing 
network to assign trust value to each peer in EigenTrust [KSGM03 . 

The computation of the stationary distribution of the Markov chain boils down to 'power 
iteration' using transition matrix lending to a nice iterative algorithm. Thus, in effect, we have 
produced an algorithm that (a) is computationally simple and iterative, (b) is model independent 
and works with the data only, and (c) intuitively makes sense. To establish rigorous properties of 



the algorithm, we analyze its performance under the BTL model described in Section 2.1 

Formally, we establish the following result: given n items, when comparison results between 
randomly chosen 0(npoly(logra)) pairs of them are produced as per an (unknown) underlying BTL 
model, the stationary distribution produced by our algorithm (asymptotically) matches the true 
score (induced by the BTL model). It should be noted that fi(nlogn) is a necessary number of 
(random) comparisons for any algorithm to even produce a consistent ranking (due to connectivity 
threshold of random graph). In that sense, up to poly (log n) factor, our algorithm is optimal in 
terms of sample complexity. 

In general, the comparisons may not be available between randomly chosen pairs. Let G = 
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([n],E) denote the graph of comparisons between these n objects with an edge G E if and 

only if objects i and j are compared. In this setting, we establish that with 0(npoly(logn)£ -2 ) 
comparisons, our algorithm learns the true score of the underlying BTL model. Here, £ is the 
spectral gap for the Laplacian of G and this how the graph structure of comparisons plays role. 
Indeed, as a special case when comparisons are chosen at random, the induced graph is Erdos-Renyi 
for which £ turns out to be constant, leading to the (order) optimal performance of the algorithm 
as stated earlier. 

To understand the performance of our algorithm compared to the other options, we perform an 
empirical experimental study. It shows that the performance of our algorithm is identical to the 
ML estimation of the BTL model. Furthermore, it handsomely outperforms other popular choices 
including the algorithm by jASllj . 

Some remarks about our analytic technique. Our analysis boils down to studying the induced 
stationary distribution of the random walk or Markov chain corresponding to the algorithm. Like 
most such scenarios, the only hope to obtain meaningful results for such 'random noisy' Markov 
chain is to relate it to stationary distribution of a known Markov chain. Through recent concen- 
tration of measure results for random matrices and comparison technique using Dirichlet forms for 
characterizing the spectrum of reversible/self- adjoint operators, along with the known expansion 
property of the random graph, we obtain the eventual result. Indeed, it is the consequence of 
such powerful results that lead to near-optimal analytic results for random comparison model and 
characterization of the algorithm's performance for general setting. 

The remainder of this paper is organized as follows. In Section [2] we will concretely introduce 
our model, the problem, and our algorithm. In Section [3] we will discuss our main theoretical 
results. The proofs will be presented in Section |4j 

Notation. In the remainder of this paper, we use C, C , etc. to denote absolute constants, 
and their value might change from line to line. We use A T to denote the transpose of a matrix. 

The Euclidean norm of a vector is denoted by ||x|| = \J^2n %f, and the operator norm of a linear 

operator is denoted by \\A\\2 = max^ x T Ax/x T x. When we say with high probability, we mean that 
the probability of a sequence of events A n goes to one as n grows: lim n _ ) . 00 F(A n ) = 1. Also define 
[n] = {1, 2, . . . , n} to be the set of all integers from 1 to n. 

2 Model, Problem Statement, and Algorithm 

We now present a concrete exposition of our underlying probabilistic model used in the analysis of 
our algorithm as well as a formal description of our problem. We then present our explicit random 
walk approach to ranking. 

2.1 Model definition 

In this section we discuss our model of comparisons between various items. As alluded to above, for 
the purpose of establishing analytic properties of the algorithm, we will assume that the outcome 
of a specific comparison is governed by the BTL model and that we perform the same number 
of comparisons for each pair that we select. However, the algorithm is itself model- independent 
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and operates with data generated in arbitrary manner. Namely, the algorithm does not need to 
estimate model parameters of the BTL model in order to perform the analysis. 

Bradley- Terry-Luce model for comparative judgment. When comparing pairs of items 
from n items of interest, represented as [n] = {1, . . . , n}, the Bradley- Terry-Luce model assumes 
that there is a weight score Wi £ (i.e. it is a strictly positive real number) associated with each 
item i £ [n]. The outcome of a comparison for pair of items i and j is determined only by the 
corresponding weights Wi and Wj. Let denote the outcome of the Z-th comparison of the pair 
i and j, such that Y\- = 1 if j is preferred over i and otherwise. Then, according to the BTL 
model, 



Furthermore, conditioned on the score vector w = {wi}, it is assumed that the random variables 
y^'s are independent of one another for all i, j, and I. 

Since the BTL model is invariant under the scaling of the scores, an n-dimensional representation 
of the scores is not unique. Indeed, under the BTL model, a score vector is the equivalence class 
[w] = {w 1 £ M n |w/ = aw, for some a > 0}. The outcome of a comparison only depends on the 
equivalence class of the score vector. 

To get a unique representation, we represent each equivalence class by its projection onto the 
standard orthogonal simplex such that ^ f w^ = 1. This representation naturally defines a distance 
between two equivalent classes as the Euclidean distance between two projections: 



Our main result provides an upper bound on the (normalized) distance between the estimated score 
vector and the true underlying score vector. 

Sampling model. We also assume that we perform a fixed k number of comparisons for all 
pairs % and j that are considered (e.g. a best of k series). This assumption is mainly to simplify 
notations, and the analysis as well as the algorithm easily generalizes to the case when we might have 
a different number of comparisons for different pairs. Given observations of pairwise comparisons 
among n items according to this sampling model, we define a comparisons graph G = ([n], E, A) as 
a graph of n items where two items are connected if we have comparisons data on that pair and A 
denotes the weights on each of the edges in E. 

2.2 Random walk approach to ranking 

In our setting, we will assume that Oy represents the fraction of times object j has been preferred to 
object i, for example the fraction of times chess player j has defeated player i. Given the notation 
above we have that a y - = (1/fc) Yli=i ^ly Consider a random walk on a weighted directed graph 
G = ([n], E, A), where a pair (i, j) 6 E if and only if the pair has been compared. The weight edges 
are defined based on the outcome of the comparisons: Aij = aij / (a^- + aji) and Aji = aji/{aij + aji) 
(note that + = 1 in our setting). We let A^ = if the pair has not been compared. Note 
that by the Strong Law of Large Numbers, as the number k — > oo the quantity A^ converges to 
Wj/(wi + Wj) almost surely. 




1 with probability 
otherwise . 
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A random walk can be represented by a time-independent transition matrix P, where Py = 
F(Xt+i = j\Xt = i). By definition, the entries of a transition matrix are non-negative and satisfy 
Y2j Pij = 1 • O ne way to define a valid transition matrix of a random walk on G is to scale all the 
edge weights by l/d maX) where we define d max as the maximum out-degree of a node. This rescaling 
ensures that each row-sum is at most one. Finally, to ensure that each row-sum is exactly one, we 
add a self-loop to each node. Concretely, 



The choice to construct our random walk as above is not arbitrary. In an ideal setting with infinite 
samples (k — > oo) per comparison the transition matrix P would define a reversible Markov chain 
under the BTL model. Recall that a Markov chain is reversible if it satisfies the detailed balance 
equation: there exists v 6 Wt such that UjPy = VjPji for all and in that case, ir 6 defined 



have Pij = Pij = (1 / d ma x)wj / (wi + Wj). That is, the random walk will move from state i to state j 
with probability proportional to the chance that item j is preferred to item i. In such a setting, it 
is clear that v = w satisfies the reversibility conditions. Therefore, under these ideal conditions it 
immediately follows that the vector w/Y2i w i ac ^ s as a valid stationary distribution for the Markov 
chain defined by P, the ideal matrix. Hence, as long as the graph G is connected and at least one 
node has a self loop then we are guaranteed that our graph has a unique stationary distribution 
proportional to w. If the Markov chain is reversible then we may apply the spectral analysis of 
self-adjoint operators, which is crucial in the analysis of the behavior of the method. 

In our setting, the matrix P is a noisy version (due to finite sample error) of the ideal matrix 
P discussed above. Therefore, it naturally suggests the following algorithm as a surrogate. We 
estimate the probability distribution obtained by applying matrix P repeated starting from any 
initial condition. Precisely, let pt(i) = P(-Xt = i) denote the distribution of the random walk at 
time t with po = (po(i)) 6 W+ be an arbitrary starting distribution on [n]. Then, 



In general, the random walk converges to a stationary distribution ir = Hindoo pt which may depend 
on po-When the transition matrix has a unique largest eigenvector (unique stationary distribution), 
starting from any initial distribution po, the limiting distribution ir is unique. This stationary 
distribution tt is the top left eigenvector of P, which makes computing it a simple eigenvector 
computation. Formally, we state the algorithm, which assigns numerical scores to each node, which 
we shall call Rank Centrality: 



Rank Centrality 

Input: G = ([n],E,A) 
Output: rank {7r(i)}i 6 [ n ] 

1: Compute the transition matrix P according to ([!]); 

2: Compute the stationary distribution tt (as the limit of d2b). 



The stationary distribution of the random walk is a fixed point of the following equation: 






T 

Pt+l 
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This suggests an alternative intuitive justification: an object receives a high rank if it has been 
preferred to other high ranking objects or if it has been preferred to many objects. 

One key question remains: does P have a well defined stationary distribution? Since the Markov 
chain has a finite state space, there is always a stationary distribution or solution of the above stated 
fixed-point equations. However, it may not be unique if the Markov chain P is not irreducible. 
The irreducibility follows easily when the graph is connected and for all edges £ E, aij > 0, 
dji > 0. Interestingly enough, we show that the iterative algorithm produces a meaningful solution 
with near optimal sample complexity as stated in Theorem [2] when the pairs of objects that are 
compared are chosen at random. 

3 Main Results 

The main result of this paper derives sufficient conditions under which the proposed iterative 
algorithm finds a solution that is close to the true solution (under the BTL model) for general 
model of comparison (i.e. any graph G). This result is stated as Theorem [T] below. In words, the 
result implies that to learn the true score correctly as per our algorithm, it is sufficient to have 
number of comparisons scaling as 0(npoly(logn)£ -2 ) where £ is the spectral gap of the Laplacian 
of the graph G. This result explicitly identifies the role played by the graph structure in the ability 
of the algorithm to learn the true scores. 

In the special case, when the pairs of objects to be compared are chosen at random, that is the 
induced G is an Erdos-Renyi random graph, the £ turns out to be constant and hence the resulting 
number of comparisons required scales as 0(npoly(logn)). This is effectively the optimal sample 
complexity. 

The bounds are presented as the rescaled Euclidean norm between our estimate ir and the 
underlying stationary distribution P. This error metric provides us with a means to quantify the 
relative certainty in guessing if one item is preferred over another. Furthermore, producing such 
scores are ubiquitous |DMJ10| as they may also be used to calculate the desired rankings. After 
presenting our main theoretical result we will then provide simulations demonstrating the empirical 
performance of our algorithm in different contexts. 

3.1 Rank Centrality: Error bound for general graphs 

Recall that in the general setting, each pair of objects or items are chosen for comparisons as per 
the comparisons graph G([n],E). For each such pair, we have k comparisons available. The result 
below characterizes the performance of Rank Centrality for such a general setting. 

Before we state the result, we present a few necessary notations. Let di denote the degree of 
node i in G; let the max-degree be denoted by (i max = maxj di and min-degree be denoted by 
d m in = minj d{] let k = d max /d m i n . The Laplacian matrix of the graph G is defined as L = D~ 1 B 
where D is the diagonal matrix with Da = di and B is the adjacency matrix with E>ij = B„ = 1 if 
G E and otherwise. The Laplacian, defined thus, can be thought of as a transition matrix 
of a reversible random walk on graph G: from each node i, jump to one of its neighbors j with 
equal probability. Given this, it is well known that the Laplacian of the graph has real eigenvalues 
denoted as 

-1 < A„(L) < ••• < Ai(L) = 1. (3) 
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We shall denote the spectral gap of the Laplacian as 

£ = 1 ~~ A max (L) 

where 



A max (L) = ma X {A 2 (L),-A n (L)} . (4) 
Now we state the result establishing the performance of Rank Centrality. 

Theorem 1. Given n objects and comparison graph G = ([n],E), let each pair G E be com- 
pared for k times with outcomes produced as per a BTL model with parameters w\, . . . , w n . Then, 
there exists positive universal constants C and C' such that for k > 4C 2 (1 + (6 5 K 2 /<i max £ 2 ) logre), 
the following bound on the normalized error holds with probability at least 1 — n~ c : 

7T — 7T II C6 5 / 2 K Hog n 

11*11 ~ i V kd max ' 

where tt(i) = Wi/ Y2e w £ an d b = maxj Wi/wj. The constant C' can be made as large as desired by 
increasing the constant C . 



3.2 Rank Centrality: Error bound for random graphs 

Now we consider the special case when the comparison graph G is an Erdos-Renyi random graph 
with pair being compared with probability d/n. When d is poly-logarithmic in n, we pro- 
vide a strong performance guarantee. Specifically, the result stated below suggests that with 
0(npoly(logn)) comparisons, Rank Centrality manages to learn the true scores with high prob- 
ability. 

Theorem 2. Given n objects, let the comparison graph G = ([n],E) be generated by selecting each 
pair (i,j) to be in E with probability d/n independently of everything else. Each such chosen pair 
of objects is compared k times with the outcomes of comparisons produced as per a BTL model 
with parameters w\, . . . , w n . Then, there exists positive universal constants C, C' and C" such that 
when d > C'logn, k > C' , and kd > C'b^logn, the following bound on the error rate holds with 
probability at least 1 — n~ c : 



\tt\\ V kd 



where Tr(i) = Wi/Yle w i o,ndb = maxjj Wi/wj. The C" can be made as large as desired by increasing 
the constants C and C' . 



3.3 Remarks 

Some remarks are in order. First, Theorem [2] implies that as long as we choose d = @(log 2 n) and 
k = uj(1) the error goes to 0. For k = O(logn), it goes down at a rate 1/logn as n increases. 
Since we are sampling each of the (2) pairs with probability d/n and then obtaining k comparisons 
per pair, we obtain 0(n log 3 n) comparisons in total with k = O(logn) and d = 0(log 2 n). Due to 
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classical results on Erdos-Renyi graphs, the induced graph G is connected with high probability only 
when total number of pairs sampled scales as fi(nlogn)-we need at least those many comparisons. 
Thus, our result can be sub-optimal only up to log 2 n (log e n if k = log 6 n and d = log n). 

Second, the b parameter should be treated as constant. It is the dynamic range in which we 
are trying to resolve the uncertainty between scores. If b were scaling with n, then it would be 
really easy to differentiate scores of items that are at the two opposite end of the dynamic range; 
in which case one could focus on differentiating scores of items that have their parameter values 
near-by. Therefore, the interesting and challenging regime is where b is constant and not scaling. 

Third, for a general graph, Theorem [l] implies that by choice of /cd max = 0(k 2 £ -2 logn), the 
true scores can be learnt by Rank Centrality. That is, effectively the Rank Centrality algorithm 
requires 0(nK 2 £ _2 poly(log n)) comparisons to learn scores well. Ignoring k, the graph structure 
plays a role through £~ 2 , the squared inverse of the spectral gap of Laplacian of G, in dictating the 
performance of Rank Centrality. A reversible natural random walk on G, whose transition matrix 
is the Laplacian, has its mixing time scaling as £ -2 (precisely, relaxation time). In that sense, the 
mixing time of natural random walk on G ends up playing an important role in the ability of Rank 
Centrality to learn the true scores. 

3.4 Experimental Results 

Under the BTL model, define an error metric of an estimated ordering a as the weighted sum of 
pairs whose ordering is incorrect: 




where I(-) is an indicator function. This is a more natural error metric compared to the Kemeny 
distance, which is an unweighted version of the above sum, since D w (-) is less sensitive to errors 
between pairs with similar weights. Further, assuming without loss of generality that w is normal- 
ized such that ^2iWi = 1, the next lemma connects the error in D w {-) to the bound provided in 
Theorem [2] Hence, the same upper bound holds for D w error. A proof of this lemma is provided 
in Section 231 

Lemma 3.1. Let a be an ordering ofn items induced by a scoring ir. Then, D w (a) < \\w — tt\\/\\w\\. 

For a fixed n = 400 and a fixed b = 10, Figure. [T] illustrates how the error scales with two problem 
parameters: varying the number of comparisons per pair with fixed d = 10 log n (left) and varying 
the sampling probability with fixed k = 32 (right). The ML estimator directly maximizes the 
likelihood assuming the BTL model [LRF57]. If we reparameterize the problem so that 6% = log(u>j) 
then we obtain our estimates 6 by solving the convex program 

k 

e argmin £ £ log(l + exp(0,- - 0;)) " " 0i), 

(i,j)&E 1=1 

which is pair-wise logistic regression. This choice is optimal in the asymptotic setting, however 
for fixed-samples there do not exist theoretical guarantees for recovering the transformed scores 0j. 
The method Count Wins, proposed recently by |AS11| . scores an item by counting the number of 
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wins divided by the total number of comparisons. Ratio Matrix assigns scores according to the top 
eigenvector of a matrix, whose (i, j)-th entry is aij/aji [Saa03j. As we see in Figure [1| the error 
achieved by our Random Walk approach is comparable to that of ML estimator, and vanishes at 
the rate of \j\fk as predicted by our main result. Interestingly, for fixed d, both the Count Wins 
and Ratio Matrix algorithms have strictly positive error even if we take k — > oo. The figure on the 
right illustrates that the error scales as 1/yfd as expected from our main result. 




Figure 1: Average error D w (a) of orderings from four rank aggregation algorithms, averaged over 20 
instances. In the figure on the right we assume that d and n are fixed while we increase k. The figure on the 
right takes k — 32 fixed and lets d increase. 

To test our algorithm on real data, we used a public dataset collected from an online polling on 
Washington PostQfrom December 2010 to January 2011. Using allourideas^] platform developed by 
[SL12], they asked who had the worst year in Washington, where each user was asked to compare a 
series of randomly selected pairs of political entities. There are 67 political entities in the dataset, 
and the resulting graph is a complete graph on these 67 nodes. We used Rank Centrality and 
Count Wins to aggregate this data. Since we do not have the ground truth for real datasets, we 
used the ranking we get on the full complete graph as the ground truth. This gives two different 
ground truth rankings for each algorithm. This ground truth is compared to a ranking we get from 
only a subset of the data, which is generated by sampling each edge with a given sampling rate and 
revealing only the data on those sampled edges. We want to measure how much each algorithm is 
effected by eliminating edges from the complete graph. Let <tgt be the ranking we get by applying 
our choice of rank aggregation algorithm to the complete dataset, and ^sample be the ranking we get 
from sampled dataset. To measure the resulting error in the ranking, we use the following metric: 

-DLi^GT, ^Sample) = - ^ |o"Gt(0 ~ ^Samplc^)! ■ 

i 

Figure [2] illustrates that, compared to Count Wins, Rank Centrality is less sensitive to sampling 
the dataset, and hence more robust when available comparisons data is limited. 

: http : / /www . washingtonpost . com/wp-srv/ interactivity/worst-year-voting . html 
http : / /www . allourideas . org 
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Figure 2: Experimental result on a real dataset shows that Rank Centrality is less sensitive to having limited 
data. 



3.5 Information-theoretic lower bound 

In previous sections, we presented the achievable error rate based on a particular low-complexity 
algorithm. In this section, we ask how this bound compares to the fundamental limit under BTL 
model. 

Our result in Theorem [2] provides an upper bound on the achievable error rate between estimated 
scores and the true underlying scores. We provide a constructive argument to lower bound the 
minimax error rate over a class of BTL models. Concretely, we consider the scores coming from a 
simplex with bounded dynamic range defined as 



S h = 



{tt G M n I V n = 1 , max — < b\ . 



ten 



M 7Tj 

We constrain the scores to be on the simplex, because we represent the scores by its projection onto 



the standard simplex as explained in Section 2.1 Then, we can prove the following lower bound 
on the minimax error rate. 

Theorem 3. Consider a minimax scenario where we first choose an estimator tt that estimates 
the BTL weights from given observations and, for this particular estimator tt, nature chooses the 
worst-case true BTL weights tt. Then, we can show that for any estimator tt that we choose, there 
exists a true score vector tt with dynamic range at most b such that no algorithm can achieve an 
expected normalized error smaller than the following minimax lower bound: 

E[||7r-7r||] b-1 1 

in! sup n — n > t= t= , (5) 

* nes b h\\ 240^10(6 + 1)^ 

where the infimum ranges over all estimators tt that are measurable functions over the observations, 
we observe the outcomes ofk comparisons for each pair of items, and we compare each pair of items 
with probability d/n. 
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By definition the dynamic range is always at least one. When b = 1, we can trivially achieve a 
minimax rate of zero. Since the infimum ranges over all measurable functions, it includes a trivial 
estimator which always outputs (l/n)l regardless of the observations, and this estimator achieves 
zero error when 6 = 1. In the regime where the dynamic range b is bounded away from one and 
bounded above by a constant, Theorem [3] establishes that the upper bound obtained in Theorem [2] 
is minimax-optimal up to factors logarithmic in the number of items n. 

4 Proofs 

We may now present proofs of Theorems [T] and [2j We first present a proof of convergence for general 
graphs in Theorem [TJ This result follows from a lemma that we state below, which shows that our 
algorithm enjoys convergence properties that result in useful upper bounds. The lemma is made 
general and uses standard techniques of spectral theory. The main difficulty arises in establishing 
that the Markov chain P satisfies certain properties that we will discuss below. In order to show 
that these properties hold we must rely on the specific model that allows us to ultimately establish 
error bounds that hold with high probability. Given the proof for the general graph, Theorem [2] 
follows by showing that in the case of Erdos-Renyi graphs, the necessary conditions are satisfied 
with high probability. Then, we provide a proof of the information-theoretic lower bound. 

4.1 Algorithm convergence for general graphs 

In this section, we characterize the error rate achieved by our ranking algorithm. Given the random 
Markov chain P, where the randomness comes from the outcome of the comparisons, we will show 
that it does not deviate too much from its expectation P, where we recall is defined as 

r _j — if i a 

p _ J dmax Wi+Wj I •> ' 

for all € E and Pjj = otherwise. 

Recall from the discussion following equation ([I]) that the transition matrix P used in our 
ranking algorithm has been carefully chosen such that the corresponding expected transition matrix 
P has two important properties. First, the stationary distribution of P, which we denote with tt is 
proportional to the weight vectors w. Furthermore, when the graph is connected and has self loops 
(which at least one exists), this Markov chain is irreducible and aperiodic so that the stationary 
distribution is unique. The next important property of P is that it is reversible-7r(i)Pjj = Tr(j)Pji. 
This observation implies that the operator P is symmetric in an appropriate defined inner product 
space. The symmetry of the operator P will be crucial in applying ideas from spectral analysis to 
prove our main results. 

Let A denote the fluctuation of the transition matrix around its mean, such that A = P—P. The 
following lemma bounds the deviation of the Markov chain after t steps in terms of two important 
quantities: the spectral radius of the fluctuation ||A||2 and the spectral gap 1 — A max (P), where 

A max (P) = max{A 2 (P),-A n (P)}. 

Since A(P)'s are sorted, A max (P) is the second largest eigenvalue in absolute value. 
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Lemma 4.1. For any Markov chain P = P + A with a reversible Markov chain P, let pt be the 
distribution of the Markov chain P when started with initial distribution p$. Then, 



^ p — ii^n — \ ~~ — + 1 — :ll A ll2\/^ — • ( 6 ) 



iFll \\tt\\ V VTmin 1 ~ P V VT 

where tt is the stationary distribution of P, 7f m i n = mini T\{i), 7r m ax = maxj vr(i), and p = A max (P) + 

HAI^V^max/TT min ■ 

The above result provides a general mechanism for establishing error bounds between an esti- 
mated stationary distribution tt and the desired stationary distribution tt. It is worth noting that 
the result only requires control on the quantities ||A||2 and 1 — p. We may now state two technical 
lemmas that provide control on the quantities ||A||2 and 1 — p, respectively. 

Lemma 4.2. For k > 13 and kd m£ix > Clogn with appropriately large constant C , the error matrix 
A = P — P satisfies 



|A|| 2 < CV l0gn 



k d m ax 

with probability at least 1 — n~ c : constant C" can be made large at the cost of possibly making C 
and C larger. 

The next lemma provides our desired bound on 1 — p. 



Lemma 4.3. When ||A||2 < CA/logn/(A;d max ) and k > 4C 2 6 5 (i m axlogn(l/d m in£) 2 > the spectral 
radius satisfies 

-. ^ £^min 

1 — > 



b 2 d max 

With the above results in hand we may now proceed with the proof of Theorem [T] 
When there is a positive spectral gap such that p < 1, the first term in ([6| vanishes as t grows. 
The rest of the first term is bounded and independent of t. Formally, we have 



TTmax/vTmin < b , ||#|| >l/y/n, and \\p Q - TT || < 2 , 

by the assumption that maxjj W{/wj < b and the fact that fr(i) = Wij (Ylj w j)- Hence, the error 
between the distribution at the t th iteration p t and the true stationary distribution tt is dominated 
by the second term in equation Q. Substituting the bounds in Lemma 4.2 and Lemma 4.3 the 
dominant second term in equation ^ is bounded by 



||pt-7r|| Cb 5 / 2 [d max log n 



t->oo ||7r|| £d m - m V k 

In fact, we only need t = 0(logra + log b + log(d max log n/(d^ in A;^ 2 ))) to ensure that the above 
bound holds up to a constant factor. This finishes the proof of Theorem [T] Notice that in order 
for this result to hold, we need the following two conditions: kd max > Clogn for Lemma 4.2 and 



k > 4C 2 b 5 d max log n(l/(i m i n ^) 2 for Lemma |4.3| Since b > 1, d max ^ ^mim and > £ < 1, the second 
condition always implies the first for any choice of 4C 2 > C . 
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4.1.1 Proof of Lemma 14.11 

Due to the reversibility of P, we can view it as a self-adjoint operator on an appropriately defined 
inner product space. This observation allows us to apply the well-understood spectral analysis of 
self-adjoint operators. In order to establish this fact define an inner product space L 2 (tt) as a space 
of n-dimensional vectors with 

n 
i=l 

Similarly, we define ||a||s- = \J (a, a) - as the 2-norm in L 2 (tt). For a self-adjoint operator A in 
L 2 (n), we define ||A||fj-2 = max a ||j4a||^-/||a||^ as the operator norm. These norms are related to the 
corresponding norms in the Euclidean space through the following inequalities. 

%Amm ||a|| < Hallff < \Amax |M| , (7) 

S^PIb < \\A\k, 2 < 7p*||A|| 2 . (8) 

'""max V ""min 

A reversible Markov chain P is self-adjoint in L 2 {jx). To see this, define a closely related 
symmetric matrix S = n 1 / 2 /*!! -1 / 2 , where ft is a diagonal matrix with flu = tt{i). The assumption 
that P is reversible, i.e. n(i)Pij = Tr(j)Pji, implies that S is symmetric, and it follows that P is 
self-adjoint in L 2 (tt). 

Further, the asymmetric matrix P and the symmetric matrix S have the same set of eigenvalues. 
By Perron- Frobenius theorem, the eigenvalues are at most one. Let 1 = Ai > A2 > • • • > A n > — 1 
be the eigenvalues, and let U{ be the left eigenvector of S corresponding to Aj. Then the ith left 
eigenvector of P is Vi = ft 1 / 2 Uj. Since the first left eigenvector of P is the stationary distribution, 
i.e. v\ = ff, we have u\{i) = tx{i) 1 / 2 . 

For the Markov chain P = P + A, where P is a reversible Markov chain such that tt t P = n, 
we let pf = pf_ 1 P. Then, 

pJ-* T = (pt-i - n) T (P + A) + vf T A . (9) 

Define So = to be the rank-1 projection of S, and a corresponding matrix Pq = YI^^-^SqTI 1 / 2 . 

Using the fact that (pg — 7r) T n _1 / 2 ui = (p£ — tt) t 1 = for any probability distribution pp, we get 
(p e - k) t P = (p e - ^-) T n- 1 /2 MlAlM Tfji/2 _ Q Then ^ from ^ we get 

pT-* T = (Pt-i ~ Z) T {P ~ Po + A) + * T A . 

By definition of Po, it follows that \\P — -Po||s-,2 = ~ S0W2 = A max . Let p = A max + HAH^, 
then 

|bt-7f|k < \\Pt-i ~ nh(\\P ~ Poh,2 + ||A[|t )2 ) + ||vr T A||^ 

t-i 



Dividing each side by ||7r|| and applying the bounds in (J7]) and ([8]), we get 



t-i 



\\Pt - < t Fmax ||P0 ~ jr|| t-l-l Fmax ||^ A| 

ll^ll ~~ P V Vfmin ||vr|| V TTmin ||^|| 



This finishes the proof of the desired claim. 
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4.1.2 Proof of Lemma 14.21 

Our interest is in bounding ||A||2. Now A = P — P so that for 1 < i, j < n, 

^ = k^— Ci i> (10) 

"'"■max 

where Cij is distributed as per B(k,pij) — kpij if 6 E and Cij = otherwise. Here B(k,pij) is a 
Binomial random variable with parameter k and pij = w .j^ w . ■ It should be noted that Cij + Cji = 
and Cij are independent across all the pairs with i < j. For 1 < i < n 

An = Pa — Pa = (l — Pij) ~ (l ~~ ^2 ^) = ^ ~~ ^ 

= -J2 A ir (11) 

Given the above dependence between diagonal and off-diagonal entries, we shall bound ||A||2 as 
follows: let D be the diagonal matrix with Da = An for 1 < i < n and A = A — D. Then, 

||A|| 2 = \\D + A|| 2 < p|| 2 + ||A|| 2 . (12) 



We shall establish the bound of 0(y fc ^ gn ) for both ||-D|| 2 and ||A||2 to establish the Lemma 



4.2 



Bounding \\D\\2- Since D is a diagonal matrix, \\D\\2 = maxj |Z)^| = max, \Au\. For a given 



fixed i, as per (10)-(11), kd max An can be expressed as summation of at most kd max independent, 
zero-mean random variables taking values in the range of at most 1. Therefore, by an application 
of Azuma-Hoeffding's inequality, it follows that 



t 2 



F(kd mSLX \Aii\ >t) <2exp(-— ). (13) 

ZKu max 

By selection of t = C \/kd max log n for appropriately large constant, it follows from above display 
that 



Au\>cJ^)<0(n-* G ). (14) 



krl 

"'"■max 



In summary, we have 



\D h <C^. (15) 



Bounding ||A|| 2 when d max < logn. Towards this goal, we shall make use of the following standard 
inequality: for any square matrix M, 

\\M\\ 2 < vIlMllillMlloo, (16) 

where ||M||i = maxj an d ll-^lloo = ||-^ T ||i- in words, is bounded above by product 

of the maximal row-sum and column-sum of absolute values of M. Since A^ and Aji are identically 
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distributed and entries along each row (and hence each column) are independent, it is sufficient to 
obtain a high probability bound (> 1 — l/poly(n)) for maximal row-sum of absolute values of A; 
exactly the same bound will apply for column-sum using a union bound. 

To that end, consider the sum of the absolute values of the ith row-sum of A and for simplicity 
let us denote it by Ri. Then, 



R > ~ : mizEi^I' (17) 



where recall that Cy = X^ — kpij with X^ an independent Binomial random variable with param- 
eters k,pij. Therefore, for any s > 0, 

F(Ri > s) = \Cij\ > kd m£LX s) 

j 

max 5 ); for an y 6 > 0, 

3 

= e3cp(-9kd asK a) J]E[exp(0|^|)]. (18) 
j 

Next, we bound E[exp(0|Cjd)]. To that end, observe that for any j:€R and 9 > 0, 

exp(0|x|) < exp(0x) + exp(— Ox). 

From this, it follows that 

E[exp(0|C„|)] < E[exp(0Qj)] + E[e>q>(-0C,y)]- (19) 

Now for any (f> E M, using the fact that X^ is Binomial distribution and 1 + x < exp(x) for any 
x G M, we have 

E[exp(0Cy)] = exp(-(j)kpij)(l + pij(exp((j)) - l)) k 

< exp(-(pkpij) exp (kpij(exp((j)) - 1)). (20) 

Using second-order Taylor's expansion, for any (j) 6 [— In 4/3, In 4/3], we obtain that 

| exp(0) -l-cj>\< (21) 



Using above display in (20), we have ave 

E[exp(<£Cij)] < exp (2kpij(f> 2 /3) . (22) 



From (19) and (22), we have that for < 9 < In 4/3, 



E[exp(0|Cy|)] < 2 exp (2k9 2 /3). (23) 



Replacing (23) in (18) and recalling the fact the degree of node di < d max , we have that for 
< 9 < In 4/3, 

P(i2i > s) < exp ( - 0£;a! max s + 2kd mSLX 9 2 /3 + d max In 2) . (24) 



16 



Using (24), the optimal choice of 9 is 9 = (3/4)s. Choosing s = J^j — (clnn + (f max ln2), for a 



given c > 1, we obtain 



Ri>\l^-i (clnn + d max ln2)) <n~ c . (25) 



To ensure that 9 = (3/4) s < In 4/3, we need 



— |— (c In n + d max In 2) < | In \, (26) 

which holds when /cd max > Clogn and k > 13 for some positive constant C. From above, and 
an application of union bound across rows and columns, it follows that with probability at least 
1 — 0(n _c+1 ), as long as kd m & x = fi(logn) with appropriately large enough constant, we have that 



II A II <- / /iog^ + dmax 

11 1,2 - c V — ^ — ' (27) 

V "'"■max 

for an appropriate choice of constant c'. Note that the above inequality reduces to the desired claim 



of Lemma 4.2 for any (i max = O(logn). 



Bounding ||A||2 when d max > logn. Towards this goal, we shall make use of the 

recent results on the concentration of sum of independent random matrices. For completeness, 
we recall the following result |Troll| . 

Lemma 4.4 (Theorem 6.2 |Troll| ). Consider a finite sequence {Z lJ }i < j of independent random 
self-adjoint matrices with dimensions n x n. Assume that 

E[Z ij ] = and E(Z ij ) p < | R p ~ 2 (A ij ) 2 , for p = 2, 3, 4, . . . 
Define a 2 = || Ei>,-(^ ij ) 2 ||2- Then, for all t > 0, 



+ 



We wish to prove concentration results on A = A — D = Y2i<j w here 



Z» = (aej - ejef )(P ij - P -) for (i,j) £ E , 

and Z 1 - 7 = if i and j are not connected. The Z lJ 's as defined are zero-mean and independent, 
however, they are not self-adjoint. Nevertheless, we can symmetrize it by applying the dilation 
ideas presented in the paper |Trollj : 



Z %3 
(Z^) T 



Now we can apply the above lemma to these self-adjoint, independent and zero-mean random 
matrices. 
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To find R and A* J, s that satisfy the conditions of the lemma, first consider a set of matrices 
{^}i<j such that = AyA^' and 



A 13 







T T 
-i e « e j 







if 6 and zero otherwise. In the following, we show that the condition on p-th moment is 

satisfied with R = l/^/kd 2 ^ and (A i3 ) 2 = (l/(kd 2 max )){A i3 ) 2 such that 



E[{Z ij Y 



p\( 



2 \^/M 



2 

max 



p-2 1 



We can also show that a 2 = \\ YU<j(A 



since 



(28) 



l\j Lii >ii o v 



eief + ejej 



1 

"■""max 







where lo is the indicator function. Therefore, we can apply the results of Lemma 4.4 to obtain a 
bound on || = ||£i<;^ll 2 : 



> t I < 2nexp 



-i 2 /2 



(1/fcdmax) + (i/V^max) 



Under our assumption that d max > log n and choosing t = C y/log nj (kd max ), the tail probability is 
bounded by 2nexp{— (C 2 logn/2)(l/(l + C))}. Hence, we get the desired bound that ||A — D\\2 < 
C y/log n/(kd mauX ) with probability at least 1 - 2n^ c2 " 2C, - 2 )/( 2C,+2 ). 



Now we are left to prove that the condition (28) holds. A quick calculation shows that 

{A ij ) p 



(A 13 ) 2 for p even , 
A 13 for p odd . 



(29) 



Furthermore, we can verify that the eigenvalues of A 13 are either 1 or —1. Hence, (A y ') p < {A 13 ) 2 
for all p > 1. Thus, given the fact that Z» = Ay A*' we have that E[{Z i3 ) p } = E[A p ij (A i3 )P] < 
|E[A?]|(A^) 2 for all p. This fact follows since for any constant c G M, cA ij X |c|(A^') 2 and 
c(A 13 ) 2 ^ |c|(A y ') 2 . Hence, coupling these observation with the identities presented in equation (29) 
we have 

E[{Z tj ) p ] * E[|Ayf](A^') 2 , 
where we used Jensen's inequality for |E[A?-]| < E[|Ay| p ]. 



Next, it remains to construct a bound on E|A?-|: 



E[|Ay| p ] 



< 



pi 



2 \^fkd 



2 

max 



(30) 
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From (10), we have Ajj = Pjj — 



-uj^Cij. Therefore, 



(i/kd max )E[\a, 



Applying Azuma-Hoeffding's inequality to Cy, we have that 

1 



Qi I >*) <2exp(-2t 2 d^fc) . 



That is, 



A; (i n 



-Cy is a sub-Gaussian random variable. And therefore, it follows that for p > 2, 



E 



1 



kcL 



< 



Pi 



2 

max 



This proves the desired bound in ( 30 ) . 
4.1.3 Proof of Lemma 14.31 



By Lemma 4.2, we have 

1-p = 1- A max (P) - \\A\\ 2 Vb 

max I ■ 

In this section we prove that there is a positive gap: (d mm /2 b 2 d max ) £■ We will first prove that 

C C^min 



1 — A max (P) > 



6 2 dn 



(31) 



This implies that we have the desired eigangap for k > 4C 2 6 5 d max logn (l/d m ; n ^) 2 such that 
C^Jb log n/(kd mSLX ) < (d min /2 b 2 d max ) ^. 

To prove (31), we use comparison theorems [DSC93] . which bound the spectral gap of the 
Markov chain P of interest using a few comparison inequalities related to a more tractable Markov 
chain, which is the simple random walk on the graph. We define the transition matrix of the simple 
random walk on the graph G as 



Q 



d: 



for G E , 



and the stationary distribution of this Markov chain is fj,(i) = d{/ Y2j 4?- Further, since the detailed 
balance equation is satisfied, Q is a reversible Markov chain. Formally, fi(i)Qij = 1/^gdi = 
v(j)Qji for all € E. 

The following key lemma is a special case of a more general result [DSC93] proved for two 
arbitrary reversible Markov chains, which are not necessarily defined on the same graph. For 
completeness, we provide a proof of this lemma later in this section, following a technique similar 
to the one in [BGPS05] used to prove a similar result for a special case when the stationary 
distribution is uniform. 
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Lemma 4.5. Let Q,/-t and P,tt be reversible Markov chains on a finite set [n] representing ran- 
dom walks on a graph G = ([n],E), i.e. P(i,j) = and Q(i,j) = if ^ E. For a = 
rmn^ eE {Tr(i)Pij/fjL(i)Qij} and (3 = maxi{n(i)/ fj,(i)}, 

1 ~ A max (P) a 

1 - A max (Q) - p ■ { ' 

By assumption, we have £ = 1 — A max (Q)- To prove that there is a positive spectral gap 
for the random walk of interest as in (31), we are left to bound a and /3. We have /J.(i)Qij = 
VXrf 6 ^ — Vl-^l an d /■*(*) > (^i/l-^D- Also, by assumption that maxjj Wi/wj < b, we have 
7r(i)Pij = WiWj/(d majX (wi + Wj)Y^£U>e) > l/(bnd m3X ) and #(£) = Wi/^2 e we < b/n. Then, a = 
min (ij)eE{^(*)A?MOQij} > \E\/(nbd mSLX ) and /3 = max i {7r(i)/ / u(i)} < b\E\/nd m i n . Hence, a//3 > 
^min/(^max^ 2 ) and this finishes the proof of the bound in (31). 

4.1.4 Proof of Lemma 14.51 

Since 1 — A max = min{l — A2, 1 + A n }, we will first show that 1 — \z(Q) < (/3/a)(l — A 2 (P)) and 
1 + A n (Q) < (/3/a)(l + X n (P)). The desired bound in (32) immediately follows from the fact that 
min{a, 6} < minja', b'} if a < b and a' < b'. 

A reversible Markov chain Q is self-adjoint in £2 (/•*)• Then, the second largest eigenvalue \2(Q) 
can be represented by the Dirichlet form £ defined as 



2 



For X n (Q), we use 

= ((J + Q)0,^ = ^X)(0(t) + 0(j))V(*)O(^i)- 

Following the usual variational characterization of the eigenvalues (see, for instance, [HJ85j, p. 176) 
gives 

I-A2Q = mm , (33 

l + A n (Q) = min^M. (34) 

By the definitions of a and /3, we have 7r(i, )P(i,j) > afj,(i)Q(i, j) and 7f(i) < (3fJ,(i) for all i and 
j, which implies 



^(0,0) > aP Q (</>,0), 



Together with (|33j), this implies 1 - A 2 (Q) < (/3/a)(l - A 2 (P)) and 1 + A n (Q) < (/3/a)(l + A n (P)). 
This finishes the proof of the desired bound. 



20 



4.2 Algorithm convergence for random sampling 

Given the proof of Theorem[T]in the previous section, we only need to prove that for an Erdos-Renyi 
graph with average degree d > C log n the following are true: 



(l/2)d 
1/2 

Then, it follows that k < 3 and (l/2)d < d r 



< di < (3/2)d, 

< e. 

iax < (3/2)d By Theorem [TJ it follows that 



(35) 
(36) 



7T 



7T 



7T 



logn 



-C" 



for some positive constant C and for A;d > 288C 2 6 5 logn with probability at least 1 — n~ 

We can apply standard concentration inequalities to establish equation (35). Apply Chernoff's 

inequality, we get P( \&{ — d\ > (l/2)d) < 2e~ d / 16 . Hence, for d > C'logn, equation (35) is true 

with probability at least 1 — 2n~ c "/ 16 . 

Finally, we finish the proof with a lower bound on the spectral gap £ = 1 — A max (D^B). To 

establish this, we use celebrated results on the spectral gap of random graphs, first proved by Kahn 

and Szmeredie for d-regular random graphs [FKS89], and later extended to Erdos-Renyi graphs 

in |FO05| . Let B be the adjacency matrix of an Erdos-Renyi random graph G(n,d/n), then it is 

shown in [FO 05] that for d> C log n, 

a 2 (B) < CVd , (37) 

with probability at least 1 — n~ c " , for some numerical constants C, C , C" , where C" can be made 
as large as we want by increasing C and C . Since we are interested in the eigenvalues of L = D~ l B, 
we define a more tractable matrix with the same set of eigenvalues: L = D~ 1 l 2 BD~ l l 2 . Because 
L is a symmetric matrix, the eigenvalues are the same as the singular values up to a sign. Let 
> 02(L) > . . . denote the ordered singular values of L. Then, it is enough to show that 



A, 



< 



d 



Then, 



l-A max (L) > 1 



a 2 {B) . 
2C_ 

Td' 



(38) 



and for d > C logn, this can be made as close to one as we want by increasing C. This proves the 



desired lower bound in ( 36 ) . 



We are left to prove the bound in ( 38 ) using the variational representation of 02 (L) : 

x T D -l/2 BD -l/2 x 



<J2{L) 



mm max 

Hen 2 xeH 



x T x 



mm max 



y T By 



< 



HeU2 y&H y T Dy 

1 y T By 



mm max 



y&H d m m V V 

MB) 

drr\\r\ 



21 



where H2 is the set of all 2-dimensional subspaces in R n and d m ; n is the minimum degree. By 
concentration of measure inequalities, we know that d m \ n > (l/2)d. 

4.3 Proof of the information-theoretic lower bound in Theorem |3] 

In this section, we prove Theorem [3] using an information-theoretic method that allows us to reduce 
the stochastic inference problem into a multi-way hypothesis testing problem. 

This estimation problem can be reduced to the following hypothesis testing problem. Consider 
a set {tt^\ . . . , 7r( M W)} of M(5) vectors on the standard orthogonal simplex which are separated by 
5, such that Hfr^ 1 ^ — \\ > 5 for all i\ ^ £2- To simplify the notations, we are going to use M as a 
shorthand for M(5). Suppose we choose an index L G {1, . . . , M} uniformly at random. Then, we 
are given noisy outcomes of pairwise comparisons with w = tt^' from the BTL model. We use X to 
denote this set of observations. Let it be the estimation produced by an algorithm using the noisy 
observations. Given this, the best estimation of the "index" is L, where L = arg min^ e jjvi] — tt^ \\ ■ 

By construction of our packing set, when we make a mistake in the hypothesis testing, our 
estimate is at least 8/2 away from the true weight tt^ l \ Precisely, L 7^ L implies that [ | vr — 7r^'|| > 
8/2. Then, 



E[||7T — TT^H] > 5 -W[L^L) 



> S_f I(L;L) + log 2 



2 I log M 



} , (39) 



where /(•; •) denotes the mutual information between two random variables and the second inequal- 
ity follows from Fano's inequality. 

These random vectors form a Markov chain L — n ( L > — X — ir — L , where X — Y — Z indicates 
that X and Z are conditionally independent given Y. Let ^L t x (A x ) denote the joint probability 
function, and ¥ x \l{x\^)i ^l(£) and ¥x(x) denote the conditional and marginal probability functions. 
Then, by data processing inequality for a Markov chain, we get 

I(L;L) < I(L;X) 

^,x(L,X) \ 



EL > x r g {F L (L)F x (X)) 



ee[M] 

^ T42 E D kl(iPx|l(^Ki) F x{l (X\£ 2 )), (40) 



M 2 

where Dkl('H') is the Kullback-Leibler (KL) divergence and the last inequality follows from the 
convexity of KL divergence and Jensen's inequality. 

The KL divergence between the observations coming from two different BTL models depend 
on how we sample the comparisons. We are sampling each pair of items for comparison with 
probability d/n, and we are comparing each of these sampled pairs k times. Let Xij denote the 
outcome of k comparisons for a sampled pair of items To simplify notations, we drop the 
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subscript X\L whenever it is clear from the context. Then 

d 
n 



l»kl(p(x|4)||p(x|4)) 



< 2n 2 kd\\(iT 
where in the last inequality we used the fact that 



2 L>kl(P(^|4)||P(^|4)) 

l<i<j<n 

2 uj\\r^{h) _ jT (h)f 



(41) 



< 



K ^) (#fl )_- f2 ) )2 + ^ 2 ) ( 



TT : 



vr; 



for independent trials of Bernoulli random variables, and tt- > 1/ (2n) for all i and £ which 
follows from our construction of the packing set in Lemma 4.6 and our choice of 5. 

The remainder of the proof relies on the following key technical lemma, on the construction of 
a suitable packing set that has enough number of entries which are reasonably separated. This is 



proved in Section 4.3.1 



Lemma 4.6. For n > 90 and for any positive 5 < l/2\/l0n, there exists a set of n- dimensional 
vectors {tt^, . . . , 7r( M )} with cardinality M = e ra / 128 such that £V n!f* = 1 and 

1 + 25y/Wn 



l-2JV10n < 



< 



n 



n 



for all i £ [n] and t S [M] , and 



5 



TT 



(*l) 



7P 



2 )|| < 



135 



for all li^t 2 - 



Substituting this bound in Eqs. (41), (40), and (39), we get 

max EN|7r-7r W ||l > E[||7r-7r (I,) ||l 
ee[M] 



3328n 2 kd5 2 + 128 log 2 



n 



}■ 



Choosing 5 = (6 - l)/(30>/i0(6 + l)Vkd^), we know that 3328n 2 kd5 2 + 1281og2 < (l/2)n for all 
6 and all n > 682. This implies that 

(6-1) 



max E I" 

ee[M] 



TT — TT 



> 



120(6+ l)y/Wkdn 

it follows that ||vr^|| < 2/y/n for all I. Then, scaling the bound by l/||7r^||, the 
normalized minimax rate is lower bounded by (b — l)/(240(6 + l)\/l0kd). Also, for this choice of 
6, the dynamic range is at most 6. From Lemma |4.6[ the dynamic range is upper bounded by 



From Lemma 



4.6 



TT. 



max ■ 



i 



< 



l + 25VUhi 
1 - 2(5 V / T0n 



This is monotonically increasing in 5 for 5 < l/(2\/l0n). Hence, for 5 < (6 — l)/((6 + 1)2\/I0n), 
which is always true for our choice of 5, the dynamic range is upper bounded by b. This finishes 
the proof of the desired bound on normalized minimax error rate for general 6. 
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4.3.1 Proof of Lemma 14.61 



We show that a random construction succeeds in generating a set of M vectors on the standard 
orthogonal simplex satisfying the conditions with a strictly positive probability. Let M = e ra / 128 
and for each I G [M], we construct independent random vectors ffO according to the following 
procedure. For a positive a to be specified later, we first draw n random variables uniformly 
from [(1 — a8y/n)/n, (1 + a6^/n))/n]. Let = [Y^\ . . . ,Yjp] denote this random vector in n 
dimensions. Then we project this onto the n-dimensional simplex by setting 



7T 



+ (i/n-yW)i , 



where Y^ = By construction, the resulting vector is on the standard orthogonal 

simplex: ^i^i = !• Also, applying Hoeffding's inequality for Y^\ we get that 



Y W - I 



By union bound, this holds uniformly for all i with probability at least 1 — 2e 63n / 128 . In particular, 
this implies that 



1 - 2a5yfn _^ 1 + 2a5^/n 

— — „ ' V 4 ^J 



n n 



for all i e [n] and t £ [M] . 

Next, we use standard concentration results to bound the distance between two vectors: 

\\ft(ti) -7f(^)|| 2 = \Vy(ti) _y(&)|| 2 _ n (y(*i) _ y(«0)2 

Applying Hoeffding's inequality for the first term, we get P(| ~ Y i h) ) 2 ~ (2/3)a 2 5 2 | > 

(l/2)a 2 5 2 ) < 2e~ n/32 . Similarly for the second term, we can show that P(| YJXi^ ~ Y i k) )\ ^ 
(1/ '4)a8\/n) < 2e~ n / 32 . Substituting these bounds, we get 

-a 2 ,5 2 < ||^)-v^)|| 2 < -a 2 <5 2 , (43) 
10 ~ " 11 _ 10 v ; 

with probability at least 1 — 4e _ri//32 . Applying union bound over < e n//64 pairs of vectors, we 
get that the lower and upper bound holds for all pairs l\ ^ £2 with probability at least 1 — 4e _n//64 . 
The probability that both conditions ((421 and (1431) are satisfied is at least l-4e _n / 64 -2e- 63n / 128 . 



For n > 90, the probability of success is strictly positive. Hence, we know that there exists at least 
one set of vectors that satisfy the conditions. Setting a = y/W, we have constructed a set that 
satisfy all the conditions. 



4.4 Proof of Lemma 13.11 

Without loss of generality, let us consider two items i and j such that W{ > Wj. When we estimate 
a higher score for item j then we make a mistake in the ranking of these two items. When 
this happens, such that ttj — Hi > 0, it naturally follows that Wi — Wj < Wi — Wj + ttj — vrj < 
\wi — 7Tj| + \-Kj — Wj\. For a general pair i and j, we have (wi — Wj)(ai — Uj) > implies that 
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\wi — Wj\ < \wi — 7Ti\ + \wj — TTj\. Substituting this into the definition of the weighted distance 
D w (-), and using the fact that (a + b) 2 < 2a 2 + 2b 2 , we get 




1/2 



1 n 

- Ti — ii y^(. w i - n i) 2 ■ 

\\w\\ 

i=l 

This proves that the distance D w {a) is upper bounded by the normalized Euclidean distance \\w — 
7rH/|MI- 

5 Discussion 

In this paper, we developed a novel iterative rank aggregation algorithm for discovering scores 
of objects given pairwise comparisons. The algorithm has a natural random walk interpretation 
over the graph of objects with edges present between two objects if they are compared; the scores 
turn out to be the stationary probability of this random walk. In lieu of recent works on network 
centrality which are graph score functions primarily based on random walks, we call this algorithm 
Rank Centrality. The algorithm is model independent. 

We also established the efficacy of the algorithm by analyzing its performance when data is 
generated as per the popular Bradley- Terry-Luce (BTL) model. We have obtained an analytic 
bound on the finite sample error rates between the scores assumed by the BTL model and those 
estimated by our algorithm. As shown, these lead to order-optimal dependence on the number 
of samples required to learn the scores well by our algorithm under random select of pairs for 
comparison. The experimental evaluation show that our (model independent) algorithm performs 
as well as the Maximum Likelihood Estimator of the BTL model and outperforms other known 
competitors included the recently proposed algorithm by Ammar and Shah |ASllj . 

Given the simplicity of the algorithm, analytic guarantees and wide utility of the problem of 
rank aggregation, we strongly believe that this algorithm will be of great practical value. 
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